Improving speech recognition and keyword search for low resource languages using web data
نویسندگان
چکیده
We describe the use of text data scraped from the web to augment language models for Automatic Speech Recognition and Keyword Search for Low Resource Languages. We scrape text from multiple genres including blogs, online news, translated TED talks, and subtitles. Using linearly interpolated language models, we find that blogs and movie subtitles are more relevant for language modeling of conversational telephone speech and obtain large reductions in out-of-vocabulary keywords. Furthermore, we show that the web data can improve Term Error Rate Performance by 3.8% absolute and Maximum Term-Weighted Value in Keyword Search by 0.0076-0.1059 absolute points. Much of the gain comes from the reduction of out-of-vocabulary items.
منابع مشابه
Babler - Data Collection from the Web to Support Speech Recognition and Keyword Search
We describe a system to collect web data for Low Resource Languages, to augment language model training data for Automatic Speech Recognition (ASR) and keyword search by reducing the Out-ofVocabulary (OOV) rates – words in the test set that did not appear in the training set for ASR. We test this system on seven Low Resource Languages from the IARPA Babel Program: Paraguayan Guarani, Igbo, Amha...
متن کاملA comparison of multiple methods for rescoring keyword search lists for low resource languages
We review the performance of a new two-stage cascaded machine learning approach for rescoring keyword search output for low resource languages. In the first stage Confusion Networks (CNs) are rescored for improved Automatic Speech Recognition (ASR) by reranking the arcs of each confusion bin. In the second stage we generate keyword search hypotheses from the rescored ASR output and rescore them...
متن کاملSpoken Keyword Rescoring and Document Retrieval for Low-resource Languages
For languages that have adequate data for automatic speech recognition (ASR), many keyword search(KWS) and document retrieval(SDR) systems have been developed with near-optimal performance. However, lacking of sufficient training data to produce high accuracy transcript, identification and retrieval of queries in speech data from low-resources languages remains challenging. To compensate for th...
متن کاملDeveloping Keyword Search under the Iarpa Babel Program
Spoken content in languages of emerging importance needs to be searchable to provide access to the underlying information. Keyword search (KWS), also known as spoken term detection (STD), is a speech processing task in which the goal is to find all the occurrences of a textual “keyword”, a sequence of one or more words, in a large corpus of speech data. In 2006, the U.S. National Institute of S...
متن کاملStrategies for rescoring keyword search results using word-burst and acoustic features
The identification of keyword queries in speech data from lowresources languages poses a challenge for current methods as speech recognition algorithms lack sufficient training data to produce high accuracy transcript. To compensate for these shortcomings, we extract signals from the data that are useful in keyword identification but are not being used by the speech recognizer. These signals ta...
متن کامل